Data analysis is often an afterthought, but shouldn't be!
2015-02-09
Data analysis is often an afterthought, but shouldn't be!
After this class students will be able to:
ggplot2Visualization is important both for
Models and tables are fine, but often I find that visualizations are more helpful for understanding what is going on (plus they make your presentations look WAY cooler)
4 data sets with the same regression results: \(y = 3 + 0.5 x\)
Example from Tufte (1997)
by Healy
By Kenworthy
By Jackman
Next example from the wonderful paper:
Thank you to Kieran Healy for being open about this so that we can all learn from it
corrplot(c.mat, method="shade", shade.col=NA, tl.col="black",
order="hclust", hclust.method="ward", tl.srt=45)
corrplot(c.mat,add=TRUE, type="lower", method="number",
order="AOE", diag=FALSE, tl.pos="n", cl.pos="n")
How could you write this differently?
Don't repeat yourself
data.to.plot <- c.mat
kHclustMethod <- "ward"
kOrder <- "hclust"
corrplot(data.to.plot, method = "shade", shade.col = NA, tl.col = "black",
order = kOrder, hclust.method = kHclustMethod,
tl.srt = 45)
corrplot(data.to.plot, add = TRUE, type = "lower", method = "number",
order = kOrder, hclust.method = kHclustMethod,
diag = FALSE, tl.pos = "n", cl.pos = "n")
Why ggplot2?
dplyr and the tidy data philosophyggalley and ggmapsggvis, the next big thinglibrary(ggmap) library(ggplot2) NYC <- get_map(location = "new york, new york", zoom = 11) p <- ggmap(NYC) p + geom_point(aes(x = dfcalls_small$Longitude, y = dfcalls_small$Latitude)) + labs(title = "311 calls in NYC, 1/28/15 - 1/29/15") + theme(plot.title = element_text(size=rel(2)))
suppressPackageStartupMessages(library(dplyr))
library(ggplot2)
packageVersion("ggplot2")
## [1] '1.0.1'
More on ggplot2 version 1.0
Next, an example based on Dawn Koffman's OPR workshop
Basic components:
world.pop.data <- read.csv("data/wdata.csv", head = TRUE, sep = ",")
world.pop.data <- tbl_df(world.pop.data)
glimpse(world.pop.data)
## Observations: 158 ## Variables: ## $ country (fctr) Algeria, Egypt, Libya, Morocco, South Sudan, Sudan, T... ## $ pop2012 (dbl) 37.4, 82.3, 6.5, 32.6, 9.4, 33.5, 10.8, 9.4, 17.5, 20.... ## $ imr (int) 24, 24, 14, 30, 101, 67, 20, 81, 65, 73, 70, 47, 89, 1... ## $ tfr (dbl) 2.9, 2.9, 2.6, 2.3, 5.4, 4.2, 2.1, 5.4, 6.0, 4.6, 4.9,... ## $ le (int) 73, 72, 75, 72, 52, 60, 75, 56, 55, 55, 58, 64, 54, 48... ## $ leM (int) 72, 70, 72, 70, 50, 58, 73, 54, 54, 54, 57, 63, 52, 47... ## $ leF (int) 75, 74, 77, 74, 53, 62, 77, 58, 56, 56, 59, 65, 55, 50... ## $ region (fctr) Northern Africa, Northern Africa, Northern Africa, No... ## $ area (fctr) Africa, Africa, Africa, Africa, Africa, Africa, Afric...
world.pop.data
## Source: local data frame [158 x 9] ## ## country pop2012 imr tfr le leM leF region area ## 1 Algeria 37.4 24 2.9 73 72 75 Northern Africa Africa ## 2 Egypt 82.3 24 2.9 72 70 74 Northern Africa Africa ## 3 Libya 6.5 14 2.6 75 72 77 Northern Africa Africa ## 4 Morocco 32.6 30 2.3 72 70 74 Northern Africa Africa ## 5 South Sudan 9.4 101 5.4 52 50 53 Northern Africa Africa ## 6 Sudan 33.5 67 4.2 60 58 62 Northern Africa Africa ## 7 Tunisia 10.8 20 2.1 75 73 77 Northern Africa Africa ## 8 Benin 9.4 81 5.4 56 54 58 Western Africa Africa ## 9 Burkina Faso 17.5 65 6.0 55 54 56 Western Africa Africa ## 10 Cote d'Ivoire 20.6 73 4.6 55 54 56 Western Africa Africa ## .. ... ... ... ... .. ... ... ... ...
All together, the layered grammar defines a plot as the combination of:
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr, color = area))
p + layer(geom = "point")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr, color = area, size = pop2012))
p + layer(geom = "point")
p <- ggplot(data = world.pop.data,
aes(x = tfr, y = le, color = area, size = pop2012))
p + layer(geom = "point")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "line")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "blank")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth", method = "loess")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth", method = "lm")
p <- ggplot(data = world.pop.data,
aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth", method = "lm", se = FALSE)
p <- ggplot(data = world.pop.data, aes(x = le, y = tfr))
p <- ggplot(data = world.pop.data, aes(x = le, y = tfr)) p + layer(geom = "point") + facet_grid(area ~ .)
p <- ggplot(data = world.pop.data, aes(x = le, y = tfr)) p + layer(geom = "point") + facet_grid(. ~ area)
p <- ggplot(data = world.pop.data, aes(x = le, y = tfr)) p + layer(geom = "point") + facet_grid(. ~ area) + layer(stat = "smooth", method = "loess")
Questions about visualization and ggplot2?
One thing that you should know that most people don't talk about:
Goal check
Review and motivation for next class
ggplot2 is confusing at firstIt just takes practice!